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. Supervised classifying of biological samples based on genetic information, (e.g. 

gene expression profiles) is an important problem in biostatistics. In order to find 
both accurate and interpretable classification rules variable selection is indispens- 
able. 

This article explores how an assessment of the individual importance of variables 
| (effect size estimation) can be used to perform variable selection. I review recent 

[T^j effect size estimation approaches in the context of linear discriminant analysis (LDA) 

and propose a new conceptually simple effect size estimation method which is at 
the same time computationally efficient. 

I then show how to use effect sizes to perform variable selection based on the 
misclassification rate which is the data independent expectation of the prediction 
error. Simulation studies and real data analyses illustrate that the proposed effect 
size estimation and variable selection methods are competitive. Particularly they 
^ lead to both compact and interpretable feature sets. 
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1 Introduction 



Modern medical research has been revolutionized by the possibility of characterizing 
diseases at a molecular level using microarrays. Classification of biological samples 
based on their gene expression continues to be a field of active research. See e.g. Pang| 
etaLl ( [2009| ; |Cao et al.| ( pOTT] l; |Xiaosheng and Simon|(pOTT]| and|Shao et al.|(|20TT) . Current 
reviews of the subject can be found in Schwender et al. 1 2008[ l; Slawski et al. < |2008[ ) as 
well as inlKim and Simon|(|2011|). 



In order to develop classifiers which are potentially useful for molecular diagnostics 
it is important to construct them based on a selection of genes (variables) strongly 
associated with the respective class labels (e.g. cancer and healthy tissue). These genes 
possess a large effect size which is generally measured by standardized differences. 

Three distinct but closely related objectives need to be achieved to identify a group of 
genes with high effect sizes ( |Ahdesmaki and Strimmer 2010 Matsui and Noma 2011} : 



(i) to establish a reliable variable ranking, 

(ii) to provide a reasonable estimate of the effect size for each gene, and 

(iii) to find a suitable cutoff point that allows one to disregard (the usually large) 
number of noise-features. 

Problems (ii) and (iii) are the main concerns of the current article. For the ranking 
problem (obj. (i)) I rely on correlation adjusted f-scores (a.k.a. "cat" - scores) introduced 
by |Zuber and Strimmer ( |2009[ ). The cat-score is a t-type statistic which takes correlation 
into account and has been shown to induce a reliable variable ranking even in the 
presence of correlation among the variables. I therefore use cat-scores to obtain effect 
size estimates (obj. (ii)). Based on these estimates a nominal prediction error is computed. 
It is dependent on the number of variables included. Variable selection is then performed 
(ob. (iii)) by determining the number of variables necessary to achieve a certain nominal 
error level. I choose to use linear discriminant analysis (LDA) as a classification method 
- a simple yet very effective approach to linear classification (Hand, 2006). 



The approach of this paper is similar to that of Efron 1 20091 and pabney and Storey 



( 2007| . However in contrast to [Efron ( 2009} my method applies to any number of classes 
and allows empirical null modeling. In contrast to Dabney and Storey ( 2007[ ) it does 
not need a computationally expensive greedy algorithm to select variables due to the 
variable ranking performed beforehand. 

The article is organized as follows: I present basic theory on LDA in chapter |2j then 
I obtain effect size estimates based on cat-scores and compare them to other effect 
size estimation approaches in chapter [3] Notably the Efron (2009) and Matsui and 



Noma ( 2011[ ) methods are presented in a unifying way using cat- scores which sheds 
new light on their similarities. Chapter [4] shows how to perform variable ranking and 
selection combining methodology introduced in chapters [2] and [3j Results of the derived 
variable selection method on simulated and real data are then presented in chapter [5] A 
discussion concludes the article. 
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2 Linear Discriminant Analysis (LDA) and its Misclassification 
Rate 



2.1 Linear discriminant analysis (LDA) and effect sizes 

LDA forms the basis of most classification algorithms currently employed, e.g. Nearest 



Shrunken Centroids commonly abbreviated as NSC, and also known as PAM, see Tib 



shirani et al.| ( |2003[ ), Shrinkage Discriminant Analysis - SDA, |Ahdesmaki and Strimmer 
(2010) - and many more. It starts by assuming a mixture model for the ci-dimensional 
data x 

/(*) = E n k f{x\k), 

k=l 

where each class k is represented by a multivariate normal density 

f{x\k) = {ln)-^\L\-^ x exp{-l(* - ^Tr\x - 

with group-specific centroids ji k and a common covariance matrix E. A sample x is 
assigned to the class yielding the highest LDA discriminant score defined as the log 
posterior probability d k JDA (x) = \og{P(k\x)}. This score can be written as 

4 DA (x) =ri"E- 1 3r- ^ fc T E-V fc + lc.g(7r fc ). (i) 

The standard form of the LDA predictor function shown in Eq.[l]can be transformed 
into a scalar product which is given by 

$ DA (x) = U k * 00l A T S k (x) + Iog(7r fc ). (2) 



See Ahdesmaki and Strimmer | 2010| | for details. In Eq.|2]we have an inner product of Ma- 
halanobis transformed variables (commonly called features ) S (x) and a corresponding 
feature weight vector co( k 'P°°^ given by 



Sk {x) = pr-Uly-m ^ x _ gfc+Pp°o^ (3) 

and 

^ frp ° ol) =P- 1/2 V- 1/2 (^- Vol ) (4) 



respectively. In this equation the pooled mean is calculated as ^ poo i = T,f=l IfVk 
the covariance matrix E is decomposed as: E = V 1//2 PV 1//2 , with a diagonal matrix 
containing the variances V = diag{cr^, . ..,crj} and the correlation matrix P = (pij). 
Remarkably, both a/ fc, P° o1 ) and S k (x) are vectors and not matrices. 
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The decomposition in Eq. [2] shows that aAP° o1 ) gives the influence of the trans- 
formed variables S(x) in prediction. Zuber and Strimmer ( 2009| > have shown that this 
Mahalanobis-transformation leads to an improved ranking of the original variables since 
it removes the effect of correlation. Thus, as in[Ahdesmaki and Strimmer ( 2010[ > the 
feature weights to will serve a measure of variable importance and the terms variables 
and features will be used interchangeably in the following. 

Additionally from Eq. [i]it can be seen that the components of 

w (*,pool) are 

decorre- 

lated a nd standardized differen ces (i.e. effect sizes) between the class k and the "pooled 
class" ( Matsui and Noma[|201l| . This is readily generalized. The effect size vector 
between any two classes k and / is defined as the difference between the two respective 
feature weight vectors a/' c 'P ool) and o/''P ool) 



CO 



(W) - 



CO 



(/:,pool) 



CO 



(Ipool) 



p-l/2y-l/2( 



■Hi) 



(5) 



Note that iv\ k ' l > is up to the scale factor (l/n k + l/n/) -1 ^ 2 equivalent to the cat-score 
vector between the classes k and / on the population level, i.e. assuming known model 
parameters (Zuber and Strimmer 2009} . Hence there is a close relationship between test 
statistics and effect sizes: The effect size is simply a sample size independent version of 
the test statistic. The statistic is denoted by a "cat" subscript in this article, i.e. 

w 2? = (l/n t + l/n I )- 1/2 «W). 

2.2 The misclassification rate of linear discriminant analysis 

In this section I look at an unconditional (i.e. not depending on the data) misclassifi- 
cation error of LDA on the population level. This quantity is called (unconditional) 
misclassification rate in the literature (Da bney and Storey [ 2007 Shao et al. |2011[ >. 

Let be a sample vector drawn from the multivariate normal distribution N(fi k , E) 
associated with class k. In the LDA algorithm it is assigned to the class yielding the 
highest score (Eq.[l]l. Using the scalar product of Eq.EJa misclassification of occurs if 
[to( k 'P°°V] T S k {x k ) + log(7r k ) < max, [co^'P 00 ^} T S,(x ( - k >)+ log(7r,). It is easily verified that 



this is equivalent to the condition 



mm 



[ w W))T[vr-m v -m Uk) _ +i og ( 



[wM]r[ w M] 



< 0. 



Since jcW ~ N(ft k , E) holds for all k S {1, . . . , K}, the expected probability of misclas- 
sifying a sample from class k into a wrong class j ^ k can be deduced from the above 
formula as: 



P(j ^ k\k) = O -min 



[ £ „(W)]T[ a ,(W)]+2lDg( f 



¥k 2J[co^] T [cv^ 1 )} 
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This results in a misclassification rate (total error probability) of 



P(error) = £ P(; ^ fc|fc) x P(fc) = £ O - min V x n k . 

t=i fc=i V ?«/r/.,(W)iTr /f) (w)i / 



fc=l fc=i V 2^/[a;(W)]T[a;(W)] 

(6) 

3 Effect Size Estimation 

For two given classes k and I a feature i with a large corresponding effect size a;. is 
most influential in differentiating between class fc and /. However a "naive" estimation 

of <df'^ (e.g. estimation by plug-in estimates) suffers from the so called "selection bias": 

estimates of co^' are biased upwards in general. For example an estimated effect size of 
1.5 based on f-scores might correspond to a true effect size of 0.7, see Fig. [l] Therefore 

reliable estimates of oJf'^ are needed in order to furnish a good estimate of Eq.^ 
3.1 Three empirical Bayes approaches 



Bayesian approaches are "immune" to selection effects (Dawid 1994||Senn]|2008| . Thus, 



both Efron (2009) and Matsui and Noma | 2011| ) employ empirical Bayes estimates to 



tackle the estimation of effect sizes. 

I am going to present their ideas in a unified way using cat-scores. This will show 
similarities between the two methods that are not readily apparent from studying the 
two original papers. Therefore a both methods are presented in considerable detail to 
clearly demonstrate the conceptual overlap between them. This will also help to indicate 
their respective weaknesses. 

Furthermore, the current section can be read as concise and yet comprehensive review 
of both methods which can be of great help to the interested reader. The empirical Bayes 
estimator presented in section 3.1.3| is an attempt to combine the strengths of both 



approaches while adressing their shortcomings. 

Let k and / be any two classes. For the sake of simplicity the feature index i (i G 
{1, . . . , d}) will be dropped in the upcoming subsections. 



3.1.1 |Efron| ( |2009) 



Efron 



(k I) 

1 2009 ) begins with transforming the statistics co c ^ t into z-scores via a ^-distribution 



with n\ + rtjt — 2 degrees of freedom: 



o- 1 (F ni+nk - 2 (co^: 
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where F n ,+n k -2 denotes the distribution function of a ^distribution with n\ + — 2 

(k I) 

degrees of freedom. He then assumes a prior density g on co ca [ given by the mixture 

s(«2P) = nM^P) + (i - vo) gA {u,£p), (7) 

where Io is a delta-function at and rjo the proportion of genes having a true effect size 
of zero. The alternative group, i.e. the nonzero effect sizes are represented by g A . In the 
following I will in general abbreviate conditioning on the alternative group with an "A" 
subscript. The statistic z is assumed to be distributed as 

I <k,l) T. T f (k,l) ,n 

Together with Eq.[7]this results in the following mixture model for z 

f(z) = W (z) + (l-f ]0 )f A (z) f (8) 
where cp(z) is the normal distribution density and f A is a mixture of the densities 



f A (z) = f <p(z- co^ )g A {co^P ) doo^P 

J —CO 



Eq.[8]is a typical case of two-group mixture model common in multiple testing situations. 
It consists of a theoretical (i.e. no additional parameters) "null" model fo = cp and 
an alternative component f A from which the "interesting" cases are assumed to be 
draw n (|Efron 2008[ >. In order to present the ideas of both Matsui and Noma ( |2011| | 



and lEfron 1 2009 1 in a unified fashion I start with computing the posterior density 

conditioned on the alternative, i.e. f{cof a P |z, z e "alternative") = f{cof a P |z, cvfjp ^ 0). 
As introduced above the "A" subscript indicates conditioning on the alternative, so that 

/^C^cat l z ) = fi^caP \ z ' z ^ "alternative"). Finally, using Bayes' rule this density can 
be computed as 

f r,,(V)\ 9 \ fA(z\co y cat ')-g A {co K cat ') 

J A (Wert \Z) - jr-T-r 

JA{Z) 

= exp(o>Spz - log{/ A (z)/<p(z)})[exp{-(a;S ) ) 2 /2)}]^(a;S ) ) . 

(k I) 

It has the form of a natural exponential family with natural parameter co cai , suf- 
ficient statistic z and cumulant generating function log{f A (z) / <p(z)} = log{[(l — 
fdr(z))/fdr(z)]} • 770(1 - Vo)}, where 

fdr(z) = P("null"|z) = = rj f j^ (9) 
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is the local false discovery rate (Efron 2008} . Conditional on the alternative this leads to 
an effect size estimate of the simple form 



E A (a> 



,(W)| 



(l/n / + l/n,) 1/2 £log 



1 - fdr(z) 
fdr(z) 1 



Since by Eq.[9]the relationship P("alternative"|z) 
the unconditional effect size estimate is: 



l-P("nuU"|z) 



m 

-- i 



(10) 

fdr(z) holds, 



e((o<M\z 



-{l/ ni + l/n k ) 1/2 -\og 



E A {a>M\z}{l-{dx(z)} 
1 - fdr(z) rjQ 



fdr(z) 1-tjQ 



{l-fdr(z)}, 



which after some further calculations becomes 



E(a>M\z 



(l/n z + l/n fc ) 1/2 -log{fdr(z)}. 



(11) 



(12) 



Note that if one used an empirical null N(0, a 2 ) with estimated c as null density fo 
the connection to the natural exponential family would be lost. Then both the natural 
parameter and the sufficient statistic would depend on a. 

Unfortunately in this case the elegant formula p2| no longer holds. This basically is 
the only downside of Efron's approach: It is conceptually simple and computationally 
efficient but it is not possible to include an additional variance parameter in the null 



model without "destroying" Eq. 12 



3.1.2 Matsui and Noma 120111 



Matsui and Noma ( [2011 1 introduce empirical null modeling into the approach of Efron 



(2009) via an empirical Bayes method. They start with a similar z-score transform. 
However, as a starting point absolute values are used: 



0> 



-l 



1 - 2 • 1 - F n 



ii+n k -2 



LO 



(W) 

cat 



Additionally only a prior on the absolute non-null effect sizes gA 
The non-null z have the conditional density 



co. 



cat 



is assumed. 



LO 



(hi) 

cat 
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CO. 



(W) 

cat 



V 



LO 



(k,l) 
cat 



The variance function V and the prior g& are estimated from the data. As in Efron ( 2009} 
they also assume a two-group mixture model for the z-scores: 



/(z) = r] f 



Ho 



oo 



(l-Vo)f A (z). 
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The null density is (in contrast to Efron) an empirical null, i.e. mean and variance 
are estimated from the data: /o(z) = cp ((z — }Iq)/o~q). The alternative density /a is 
computed as 



CO. 



(Kl) 

cat 
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to. 



(Kl) 

cat 



V 



v 



CO. 



(Kl) 

cat 



8 A 



8a 



to 



(Kl) 
cat 



to. 



(Kl) 
cat 



CO. 



(Kl) 
cat 



CO. 



(Kl) 
cat 



The application of Bayes' rule gives a posterior expectation of 



nately not as simple as Eq. 10 



CO. 



(Kl) 
cat 



CO. 



(Kl) 
cat 



CO. 



(Kl) 
cat 



8 A 



CO. 



(Kl) 

cat 



/a(z) 



CO. 



which is unfortu- 



(Kl) 
cat 



<^r»t — 2 



to 



(Kl) 
cat 



V to. 



8 A 



CO. 



(Kl) 
cat 



/a(z) 



CO. 



(Kl) 

cat 



The statistic 



co. 



(Kl) 

cat 



is then transformed back into a absolute value effect size: 



co 



(Kl) 



(l/ni + l/n k ) 



1/2 1 



-l 

- n,+n k -2 



a' 



(W) 
cat 



As in Eq. 12 the final effect size estimate is 



a; 



(Kl) 



co 



(KD 



z (l-fdr(z)). 



(13) 



In contrast to Efron's method the approach of Matsui and Noma ( 2011[ > allows 
empirical null modeling and thus leads to better effect size estimates in general as 
Matsui and Noma ( |2011 > also convincingly show in their article. 

However this increased accuracy comes at price. The estimation of variance function 
V can take up to two hours. Furthermore it has to be estimated for every number of class 
samples and ti; separately. This makes cross-validation based assessment of predictive 
accuracy extremely time consuming. Additionally even if V has been computed for 
fixed and ri\ the estimation of the final effect size will take up to a couple of minutes. 

In summary while |Matsui and Noma ( 2011[ > provide a method that is superior to 
Efron's method in terms of bias, it also is computationally very demanding. 
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3.1.3 A simple empirical-Bayes approach 

In this section I will derive another more heuristic approach to the reliable estimation 
of effect sizes. It tries to combines the strengths of Matsui and Noma ( |2011| | and |Efron| 



(2009). Empirical null modeling will be included, it will be computationally tractable 



and provide sufficient accuracy. 

Observe that in non-empirical Bayes frameworks reliable estimation of effect sizes is 
generally achieved by shrinking initial estimates of statistics playing th e same role as 



£*>cat . F° r example in the popular PAM algorithm ([Tibshirani et al. 



2003) the estimated 



^-scores are shrunken using a parameter A estimated by cross validation. 

Therefore an appropriate adaptive shrinkage of the original should provide us with 
reasonable effect size estimates. As it turns out, this adaptive shrinkage can easily be 
achieved by employing false discovery rates. 

The first step in my heuristic approach to achieve a shrinkage of is the assump- 
tion of a two component mixture model on the effect sizes: 

ffrSP) = 1ofo(*>SP) + (1 - Vo)f A (co%P), (14) 

leading to corresponding fdr estimates of Eq. [9] Assuming a centered null distribution, 
we can now make use of the "naive" estimates Ea (co^^j = coW) and correspondingly 

Eq (co^j = (since fo is centered). The subscript indicates a conditioning on the 

null distribution, E (co {k ' l A = E (co^ \wW) G "null"). It now holds by the law of 
total probability and Eq.|9]that the effect size is given by 



t„W ) = (l/m + l/n k y^{ E [a,™ ) • P [w^> G "nuU'> cat 



+ E A (a4a?) • P (aSt* G "alternative" |o; cat 
= (l/m + l/n k ) 1/2 E A (co^P) • P (co^P G "alternative"!^ 



= E A (^).(l-fdr(o;^)) 

= ^)(l-fdr(o;^)). (15) 

Eq. |l5]is very similar to Eq. 13 and Eq. 11 however no full Bayesian posterior is com- 
puted. Instead simple non-Bayesian estimates for the expectations in the two groups 
model Eq.|l4]are employed. This makes the implementation of Eq. [15] computationally 
efficient. 

There is an obvious downside though: Large (with respect to their absolute value) 
statistics usually have a high fdr value close to 1. Therefore they are hardly shrunken 
at all although their effect size is usually grossly overestimated. Thus it is necessary to 
impose a minimum shrinkage. From the results of the real data analysis in table 1 of 
Matsui and Noma ( |2011| it can easily be seen that the empirical Bayes method that these 
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authors apply imposes a shrinkage of at least 50% on the top 5 test statistics. I therefore 
also set the minimum shrinkage to 50% leading to the formula 



CO 



(k,l) 
fdr 



CO 



(W) 



mm 



{0.5; [1 



fdr(a;£ ;) )]} . 



(16) 



I call this fdr-effect size estimation (fdr-effect) and abbreviate ^1 — fdr(a^at^)) 
by cujjffa. Note that a fdr cutoff of 50% is conceptually very close to Higher Criticism 



Thresholding (Klaus and Strimmer 2012} . 

Perhaps surprisingly, in next section it will be shown that it is competitive with 
regard to the attained accuracy even though no sophisticated posterior estimates are 



used. The adaptive shrinkage performed in Eq. 16 can be interpreted as being in between 
the full empirical Bayes approaches of Efron ( 2009} or |Matsui and Noma ( 2011} and soft 
thresholding using a single shrinkage parameter for all statistics as in Tibshirani et al. 
( |2003| . 



3.2 Evaluation of Effect Size Estimation Methods on Real and Simulated 
Data 

A comparison of effect size estimation methods using simulated data is shown in 
Fig. [l] Specifically I compare the effect size estimation using "naives" approaches 
(simple cat and t -scores) and the more sophisticated ones described in the previous 
section abbreviated as MatsuiNoma, Efron and fdr-effect respectively. For the methods 
MatsuiNoma and Efron I use the implementations offered by the authors, for fdr-effect I 
perform cat-score and fdr estimation using the R-packages st and f drtoo l (jStrimmer 



2008a j). In the real data analysis displayed in Fig. |2]the package locfdr (Efron 2004 



2007} 2008 1 is applied since this allows a straightforward use of an empirical null as it 



has been suggested in Matsui and Noma| (|2011} and | Efron| (|2004} for this data set. 

I follow closely the setup used in |Smyth| ( |2004} ,ppgen-Rhein and Strimmer ( 2007} 
and Zuber and Strimmer ( 2009} to simulate gene expression data. The parameters are 
chosen in such a way that effect sizes between 1 and 3 are obtained which roughly 



corresponds to the range considered in the simulation studies of Matsui and Noma] 
( pOTT) . 

The number of statistics was fixed atd = 1000 with 200 statistics designated to be 
differentially expressed. The variances across genes were drawn from a scale-inverse- 



chi- square distribution Scale-inv-^ 2 (do, Sq 



With Sq 



1 and do = 1, i.e. the variances 



vary moderately from gene to gene. Furthermore, the difference of means for the 
differentially expressed genes (1-200) were drawn from a normal distribution with 
mean zero and the gene-specific variance multiplied with a scale factor set to 0.3. For 
the non-differentially expressed genes (201-1000) the difference was set to zero. The 
data were generated by drawing from group-specific multivariate normal distributions 
with the given variances and means, employing a block diagonal correlation structure 



intended to mimic gene expression data. This structure was generated as in Guo et al. 
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Estimation of Effect Sizes on Simulated Data 



True Effect Size 

MatsuiNoma 

fdr — effect 

Efron 

Cat 

t-score 



260 



400 



600 
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1000 



Figure 1: Comparison of effect sizes on simulated data following the Smyth ( 2004[ ) model 



(2007) with block size 100 and block entries equal to 0.9^~^. Furthermore, the sample 
sizes ri\ and n.2 are equal with n\ = = 8. 

The effect size estimates are plotted in Fig. [l] according to their rank. It is important 
to note that this does not tell us whether the respective ranking is correct. Thus, even 
though the effect size estimates of the cat-score and an ordinary f-score are very similar, 
this does not mean that their induced ranking is comparable. 

It can be seen that fdr-effect and MastsuiNoma yield good results, while Efron's 
method has a higher bias for effect sizes up to 1, a phenomenon already observed by 
Matsui and Noma ( 2011[ ). The "naive" approach using cat-scores is far off for effect sizes 



up to 1.5. However all methods overestimate large effect sizes. It follows that variable 
selection methods relying on effect size estimates will generally have a tendency of 
choosing only a relatively small number of variables in data sets with large effects. 

This is in fact a phenomenon already observe d by jAhdesmaki and Strimmer ( |2010] > 
for the Efron algorithm applied to the Singh et al. ( 2002[ ) prostate cancer gene expression 
data. This data consists of gene expression measurements of d = 6033 genes for n = 102 
patients, of which 52 are cancer patients and 50 are healthy. It has already been analyzed 
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Estimation of Effect Sizes on Singh Data 



MatsuiNoma 
fdr-effect 
Efron 
Cat 

t-score 
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Figure 2: Comparison of effect sizes for the Singh et al. |2002) data 



in Efron ( 2009[ ) and Matsui and Noma| ( 2011[ ). Fig. |2]shows the analysis results. As in 
the simulated data the "naive" approaches are far off, while Efron and MatsuiNoma are 
quite similar. Note however, that MatsuiNoma gives significantly lower estimates of 
large effect sizes than Efron a phenomenon already noted in Matsui and Noma ( 2011| ). 
The fdr-effect method yields similar results to MatsuiNoma for large effect sizes but 
arrives at zero estimates much faster than MatsuiNoma and Efron. In conclusion all 
empirical Bayes methods considered seem to give sound results here, while the empirical 
methods are probably grossly overestimating the effect sizes. 



4 Variable Selection and Estimation of the Prediction Rule 

4.1 Estimation of the prediction rule and local false discovery rates 

For the estimation of the prediction rule (Eq. |2]) I mostly employ James-Stein-type 
estimators as in shrinkage discriminant analysis - SDA, Ahdesmaki and Strimmer ([2010 ). 
The group centroids ji k are estimated by the empirical means, for the correlations P the 
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ridge-type estimator from Schafer and Strimmerj (12005 ( is used and the variances V are 



estimated by the shrinkage estimator from Opgen-Rhein and Strimmer ( 2007[). Finally 
the proportions are obtained by using the frequency estimator from Hausse r and| 
Strimmer ( 2009| . For SDA I employ the implementation provided by the R package sda. 
The local false discovery rates used in the fdr-effect approach are learned by using the 
Grenander density estimator and truncated maximum likelihood for the empirical null 
Strimmer (2008b). As in chapter[3]the implementation offered by the R package 



as m 



f drtool is employed. 



4.2 Variable ranking and selection 
4.2.1 Variable ranking 

Before being able to select variables a variable ranking needs to be established (obj. (i)). 
In the two class case this is straightforward since the feature weight vector for class 
one u>i corresponds to the effect size vector a/ 1,2 ) and the feature weight vector for 
class two u>2 to the effect size vector — a/ 1,2 ). Thus variables can be ranked according 
to the absolute value of a/ 1,2 ). In the the case of multiple classes the situation is more 
complicated. The feature weight vectors of the different classes need to be summarized 
in a certain way to obtain the importance of each feature i in class prediction. Here I use 
the summary statistic S; proposed by ( |Ahdesmaki and Strimmer |2010[ > and given by 



K 

E 

fc=l 



s^Lf-S 001 ^ ' < 17 > 

> ! ° o11 - (l/ttfc — l/n) _1 ^ 2 o;- ,: ' p . Since false discovery rates are generally 



where w 



cat,! 



assumed to be monotone, Eq. 15 shows that using fdr-effect effect size estimates w^ 00 ^ 



would produce the same ranking as the cat-scores if they were used instead of <x> 



(/c,pool) 



cat 



to compute S, in Eq. 17 



4.2.2 Misclassification rate based variable selection 

Having obtained estimates S)^ of ctff^P and of Ttk we can now compute an estimate 

of the missclassification rate using Eq. 6 Let wlz) (t) be the vector of the t top-ranked 
variables according to the ranking induced by the vector S of all statistics S, given by 



Eq. 17 We then have an estimate of the misclassification rate which depends on t: 

K 

P(error)(f) = £ P(^ k\k)(t) x P(Jfc) 
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Efron| ( [2009| > perf orms feature selection by choosing a level a = 0.05 as a target 



missclassification rate for the estimate in Eq. 18 Although one could view a as a tuning 
parameter I follow his suggestion in this regard since experiments with lower results 
only lead to very large feature sets showing only to a negligible improvement of the 
classification performance. 

After the target error a has been set a feature threshold t* is obtained by including 
as many features as necessary to reach it, i.e. P(error)(£*) = oc. Since usually a lot of 
features are shrunken to zero, it is possible that the target error can not be reached. Then 
all the features will be included. This however is extremely unlikely to happen in real 
high dimensional data analysis. Finally all features fulfilling S, > S* are included in the 
classifier. 



5 Analysis of Real and Simulated Data 



5.1 Simulations 

In this section I compare variable selection based on the misclassification rate (MR) 
with several other state of the art thresholding variable selection approaches, namely 
false-non discovery rate (FNDR) thresholding (Ahdesmak i and Strimmer||2010[), H igher 
Criticism (HC) thresholding (Donoho and Jin 2008| l and the RAM algorithm (Tibshirani 
et all 120031). As a base line classifier I also include the results of classification with all 



features, i.e. performing no variable selection. 

The simulations follow closely the setup of Witten and Tibshirani (2011). A training 
set of size 100 and a test set of 1000 samples are created with a dimension of d = 500 
variables. In total 25 runs of each simulation setup are performed. 



5.1.1 Simulation setup 1 

In this setup there are four classes with equal probability (0.25) no correlation and unit 
variance. 25 features are differentially expressed in each class with an effect size of 
0.7, yielding a total number of 100 differentially expressed features. Since there is no 
correlation we perform Diagonal Discriminant Analysis (DDA), i.e. LDA with identity 
covariance £ = 1^. The results are displayed in Tab.|l] 

It can be seen that thresholding the summary statistic S (Eq. [l7| ) by false-non dis- 
covery rates or Higher Criticism yields hardly any significant features in most runs. 
Consequently the estimated prediction errors are quite high. 

Misclassification rate based feature selection as well as RAM however identify fea- 
tures useful for classification. This indicates that "analytical" thresholding methods, 
which do not rely on the optimization of a tuning parameter may not work reliably 
when the effect sizes are small. 
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Table 1: Prediction errors and number of selected features for simulation setup 1, the 
number in the round brackets is the estimated standard error over 25 runs, the true 
number of differentially expressed features is 100. 



Method 


Prediction Error 


Features 


DDA-MR 


0.1077 (0.0177) 


156.48 ( 64.70) 


DDA-FNDR 


0.2482 (0.1272) 


39.24 ( 23.72) 


DDA-HC 


0.1880 (0.0626) 


152.32 (193.48) 


PAM 


0.0923 (0.0163) 


253.6 (116.26) 


DDA-ALL 


0.1555 (0.0180) 


500 



Table 2: Prediction errors and number of selected features for simulation setup 2, the 
number in the round brackets is the estimated standard error over 25 runs. The true 
number of differentially expressed features is 200. 



Method 


Prediction Error 


Features 


LDA-MR 


0.000 (0.000) 


63.16 (7.215) 


LDA-FNDR 


0.000 (0.000) 


60.96 (6.567) 


LDA-HC 


0.000 (0.000) 


85.04 (8.677) 


PAM 


0.088 (0.018) 


294.0 (69.43) 


LDA-ALL 


0.093 (0.014) 


500 



5.1.2 Simulation setup 2 



In this simulation I use a |Guo et al. ( |2007[ | type block correlation with 5 blocks of size 



100 x 100. As in section [3] each block entry is given by 0.9^~^, thus we have some highly 
correlated variables within blocks but variables in different blocks are independent. 
Note that Witten and Tibshirani ( 2011[ > report to use an entry size of 0.6. This is 



probably a misprint since my results obtained for PAM are quite similar to the ones 
reported in their article, while for 0.6 the error of PAM only about 5 %. 

There are two classes with equal probability (0.5) and 200 features are differentially 
expressed with effect size 0.6, all of them are attributed to class 2. Since there is correlation 
in this setting I perform LDA. 

It can be seen in Tab. |2]that all feature selection methods except for PAM, which does 
not take correlation into account, perform quite well here. 
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5.2 Gene expression data 



In Ahdesmaki and Strimmer (2010) the relative effectiveness of the FNDR and HC 
thresholds to select relevant genes in shrinkage discriminant analysis applied to gene 
expression data has already been compared. I followed their setup here and analyzed 
four clinical gene expression data sets related to prostate cancer ([Singh et al.| [2002) , 
B-cell lymphoma (Alizadeh et al. 2000), colon cancer (Alon et al. 1999} , and brain cancer 
( |Pomeroy et al.[|2002} ~ 

Specifically, balanced 10-fold cross-validation with 20 repetitions was performed to 
obtain error estimates and their standard deviations. The number of selected features is 
inferred by single run of the respective variable selection method on the whole data set. 
Only for PAM this was repeated several times, since the number of selected variables 
selected by this algorithm varies considerably between several runs in a row on the same 
data set. 

In Tab. [3] it can bee seen that my approach has a performance similar to the other 
approaches. Interestingly, the MR approach shows a more "adaptive" feature selection, 
leading to appropriate feature sets for each problem. In the brain data set a very compact 
set of features is selected yielding a prediction error which is nonetheless in the range 
of the other approaches. The same is true for the Lymphoma and Colon data sets. This 
demonstrates that a variable selection method based on effect sizes leads to compact 
and yet effective molecular signatures. 



6 Discussion 

In this paper I reviewed and extended statistical techniques related to effect size es- 
timation in linear classification and showed how to use them for variable selection. 
The fdr-effect method proposed for effect size estimation has been shown to work as 
well as competing approaches while being conceptually simple and computationally 
inexpensive. It therefore successfully unites the strengths of the approaches presented in 
Efron| ( |2009} and |Matsui and Noma| | |2011| ). 



Additionally, I gave a unified treatment of the effect size estimation approaches 
presented in these two papers elucidating similarities not apparent when considering 
the original publications only. 

Variable selection by minimizing the misclassification rate has been somewhat ne- 
glected in the literature but I showed in accordance with Dabney and Storey (2007), 



Efron ( 2009[ ) and Matsui and Noma ( 2011} that it is indeed very well suited for real world 



problems. In addition, it is also much more intuitive than selecting a non-interpretable 
regularization parameter as for example in the PAM algorithm and leads to compact 
and interpretable feature sets. 

In this work I proposed a conceptually simple and competitive variable selection 
algorithm that gives priority to genes with large effect sizes and is thus easy to interpret. 



This has been achieved by extending and combining the ideas of Dabney and Storey 
d2007) , [Efron] ( [20091 ) and |Matsui and Noma| ( |20TT| ). 
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Table 3: Analysis of four cancer gene expression data sets with shrinkage discriminant 
analysis. The number of selected features are determined by a single feature selection 
run on the whole data set. 



Data / Method Prediction Error Selected Variables 



Prostate (d = 6033, n = 102, K — 2) 



LDA-MR 


0.0630 (0.0050) 


1 O A 

134 


LDA-FNDR 


0.0550 (0.0048) 


1 ^1 


LDA-HC 


0.0497 (0.0045) 


116 


PAM 


0.0850 (0.0061) 


1 72-377 


Lymphoma (d = 4026, n = 62, K = 3) 




LDA-MR 


0.0211 (0.0039) 


J4 


LDA-FNDR 


0.0036 (0.0018) 


392 


LDA-HC 


0.0000 (0.0000) 


345 


PAM 


0.0234 (0.0041) 


2796 - 2383 


Colon (d = 2000, n 


= 62,K = 2) 




LDA-MR 


0.1291 (0.0093) 


28 


LDA-FNDR 


0.1278 (0.0088) 


168 


LDA-HC 


0.1233 (0.0087) 


122 


PAM 


0.1160 (0.0921) 


13-23 


Brain (d = 5597, n 


= 42,K = 5) 




LDA-MR 


0.1628 (0.0126) 


56 


LDA-HC 


0.1417 (0.0108) 


131 


LDA-FNDR 


0.1525 (0.0120) 


102 


PAM 


0.2023 (0.0118) 


42-5587 



High expectations are associated with the promise of a personalized medicine promis- 
ing tailored treatments based on genetic and other information of the patient. In order 
to develop molecular diagnostics guiding these treatments, statistical approaches for 
effective and interpretable classification are indispensable. 

The methodology presented in this article provides interpretability and applicability 
for biological study and medical use. Reliable effect size estimates allow one to iden- 
tify genes having discriminative power while variable selection based on these effect 
size estimates allows the selection of the most important genes for the construction of 
classification algorithms. 
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